Document Summary:
This is an R notebook for the Fundulus heteroclitus clinal genetics analysis.
Top level summaries of methods and results are given in the main text. More detailed explanatation of each step are provided as comments in a header or footer within each code chunk.
Code chunks in R, python and bash are generally run locally, while shell scripts are nearly always run remotely on the Clark University high performance computing cluster.
Project Rationale:
Combine raw data from all F. heteroclitus GBS sequencing projects in the Oleksiak/Crawford lab into a single genotyping run. Then use these data to describe the population genetic structure of the entire F. heteroclitus species range and to parse adaptive genetic variation from neutral structure owing to secondary reconcact or isolation by distance.
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
# edge definitions with the node IDs
tab1 -> tab2 -> tab3 -> tab4 -> tab5 -> tab6 -> tab7
tab6 -> tab8
}
[1]: 'library prep (Elshire GBS)'
[2]: 'sequencing (illumina hiseq2000)'
[3]: 'qc checks and datawrangling (fastqc and bash)'
[4]: 'demultiplex and read filtering (stacks: process_radtags)'
[5]: 'read mapping (bwa_mem)'
[6]: 'genotype likelihoods (ANGSD)'
[7]: 'demography and population structure'
[8]: 'selection'
")
Sequencing and Genotyping Genotyping by sequencing libraries are produced according to Elshire et al 2011 and sequenced usig Illumina HiSeq. After initial quality control, the reads are cleaned and assigned to individual fish using the process_radtags fork of STACKS. The cleaned reads are mapped to the Fundulus heteroclitus genome using BWA-MEM. The resulting BAM files are then used as input to calculate genotype likelihoods in ANGSD.
Population Genetics All population genetic analysis is conducted from the genotype likelihoods. A summary of these methods are available in the relevant section below
Samples
sites<-read.table("./sites/sites.txt", header = T, sep = "\t")
kable(sites)
| Population_Full_Name | Abbreviation | N | Longitude | Latitude |
|---|---|---|---|---|
| Brayton Point, MA | BP | 50 | -71.18604 | 41.71250 |
| Clinton, CT | CT | 24 | -72.52448 | 41.27902 |
| Sapelo Isalnd Ferry Dock, Brunswick, GA | GA | 41 | -81.35452 | 31.44932 |
| Horseneck Beach, MA | HB | 50 | -71.02556 | 41.50449 |
| Kings Creek, Severn, VA | KC | 25 | -76.42462 | 37.29845 |
| Mattapoisett, MA | M | 31 | -70.81464 | 41.65875 |
| Mount Desert Island Bio Lab | MDIBL | 24 | -68.29441 | 44.42901 |
| Wiscassett, ME | ME | 27 | -69.66481 | 43.99695 |
| Mantoloking, NJ | MG | 331 | -74.07045 | 40.05046 |
| New Bedford Harbor | NBH, F, H, P, Syc | 150 | -70.90760 | 41.62486 |
| Manteo, NC | NC | 24 | -75.66737 | 35.90029 |
| Oyster Creek, NJ | OC | 47 | -74.18475 | 39.80866 |
| Rutgers, Tuckerton, NJ | RB | 423 | -74.32448 | 39.50878 |
| Charleston, SC | SC | 26 | -79.91624 | 32.75873 |
| Stone Harbor | SH | 62 | -74.76492 | 39.05666 |
| Slocum River, MA | SLC | 30 | -70.97109 | 41.52257 |
| Succotash Marsh, Matunuck, RI | SR | 49 | -71.52557 | 41.38235 |
| Wilmington, NC (Carolina Beach State Park) | WNC | 30 | -77.91932 | 34.04998 |
| grandis | grandis | 34 | -84.17740 | 30.07200 |
#chunk to pull map
The clinal dataset is composed of 9 lanes of HiSeq2500 run in SE100. These reads cover 1478 individuals gathered from 5 separate experiments and an additional set of sequencing libraries used for regions undersampled in the previous experiments (CT, NC, SC and ME). All lanes use the same cut sites but vary in depth and quality of library preparation (some libraries were made from very old samples).
This variation in sequencing depth and quality produced serious issues when calling genotypes using TASSEL. Given the nature of combining experiments into a new analysis, populations were not distributed across unique library preparation and sequncing runs leading to variation in sequencing depth confounded with population. There was a strong inverse correlation between population-level sequencing depth and observed heterozygosity, suggesting that heterozygotes were falsely called as homozygotes in undersampled populations. To ameliorate the variation in library prep and sample quality, this analysis does not rely on called genotypes. Instead we use ANGSD to generate genotype-likelihoods and use these likelihoods to conduct all downstream analyses.
Other atypical challenges associated with this dataset are the inclusion of the sister species F. grandis in the seqeuncing data and unequal sampling across space and (potentially) lineages.
Lane names:
H7T6MADXX_1 H7T6MADXX_2 H9AREADXX_1 H9AREADXX_2 H9CWFADXX_1 H9WYGADXX_1 H9WYGADXX_2 HMMKLADXX_1 HMMKLADXX_2
Here we run sequencing-level QC reports for each lane using fastqc before beginning the analysis
#!/bin/bash
# set the number of nodes
#SBATCH --nodes=1
# set the number of cpus
#SBATCH --cpus-per-task=6
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-02:00:00
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=fastqc
# set name of output file
#SBATCH --output=fastqc.out
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
files="H7T6MADXX_1
H7T6MADXX_2
H9AREADXX_1
H9AREADXX_2
H9CWFADXX_1
H9WYGADXX_1
H9WYGADXX_2
HMMKLADXX_1
HMMKLADXX_2
"
for file in $files
do
/home/jgibbons/SOFTWARE/FastQC/fastqc -t 4 /home/ddayan/fundulus/seqdata/${file}_fastq.gz -o /home/ddayan/fundulus/seqdata/fastqc
done
FastQC reports look good for all lanes.
We will use ANGSD to create population level allele frequency estimates rather than call indiviudal genotypes. The input files for ANGSD are indivudal level BAM files. Therefore the first step after fastQC check is to demultiplex (assign reads to an individual), remove low quality reads and trim adapter and barcode sequences. Ths is accomplished with the process_radtags program from stacks v2.3
Barcode keys (files used to demultiplex sequence data) are a bit complex given that some fish are sequenced twice and stacks requires each lane be run separately through process_radtags. The solution taken here is to give duplicate fish a unique ID suffix, but with a shared prefix, then cleaned, demultiplexed reads from each fish in each lane are combined with their additional reads using prefix matching.
#Here we make new fish_ids, based on only population and library prep id, then mark duplicates (duplicate fish_ids occur because some individual fish are sequenced twice i.e. in different lanes)
barcodes<-read.table("./snp_calling/barcode_keys/cline_barcode_key.txt", sep = "\t", header = T)
tmp<-paste(barcodes$Pop , "_", barcodes$LibraryPrepID, sep = "")
barcodes$fish_id<-make.names(tmp, unique = TRUE)
write.csv(barcodes, row.names = FALSE, file = "./snp_calling/barcode_keys/cline_barcode_key.csv", quote = FALSE)
# used python script on hand for slightly different format: create a unique single variable for sequencing lane (i.e. colbind flowcell and lane with a _) then run python script below
""" The input file has four columns, this script takes writes columns 2 and 3 (barcode and individual) to a new file based on the value of column 4."""
import csv
with open('./snp_calling/barcode_keys/cline_barcode_key.csv') as fin:
csvin = csv.DictReader(fin)
# Category -> open file lookup
outputs = {}
for row in csvin:
cat = row['Flowcell_Lane']
# Open a new file and write the header
if cat not in outputs:
fout = open('./snp_calling/barcode_keys/{}_key.csv'.format(cat), 'w')
dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
dw.writeheader()
outputs[cat] = fout, dw
# Always write the row
outputs[cat][1].writerow(row)
# Close all the files
for fout, _ in outputs.values():
fout.close()
# Oops, only meant to keep barcode and sample id and need to write to tab delimited file instead.
for i in ./snp_calling/barcode_keys/*csv
do
cut -d "," -f 2,10 $i > ${i%.csv}.tmp
done
for i in ./snp_calling/barcode_keys/*tmp
do
tr "," "\\t" < $i > ${i%.tmp}_barcodes.txt
done
#clean up directory
rm ./snp_calling/barcode_keys/*.tmp
rm ./snp_calling/barcode_keys/*[12]_key.csv
#remove first line from all files
sed -i -e "1d" ./meta_data/*_barcodes.txt
Now that all reads have been demultiplexed and cleaned, and duplicate sequencing runs for indivudals have been merged, we align to the F. heteroclitus genome and save results as individual level BAM files.
Summary
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=bwa_default
# set name of output file
#SBATCH --output=bwadefault.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
for i in `find ../cleaned_tags/ -name '*.gz'`
do
/opt/bio-bwa/bwa mem -t 38 ../genome/f_het_genome_3.0.2.fa.gz ../cleaned/${i} | /opt/samtools-1.6/bin/samtools view -@ 16 -bSu - | /opt/samtools-1.6/bin/samtools sort -@ 16 - -o ./${i:0:-6}.$
done
Next is a QC check on the mapping process.
What are the depths of mapped reads (from bam files)?
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=1
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bam_depths
# set name of output file
#SBATCH --output=bam_depths
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in *bam
do
$samtools depth $i | awk -v var="$i" '{sum += $3; n++} END {print " '${i%.bam}' ", sum / n}' >> depths.txt
done
bam_depths<-read.table("./snp_calling/QC/depths.txt", sep = " ")
bam_depths<-bam_depths[,-c(1,3)]
colnames(bam_depths)<-c("fish_id", "depth")
ggplot(data = bam_depths)+geom_density(aes(x = depth))+labs( x = "Sequencing depth", y = "Density")
Read Mapping Fig. 1: Mean depth at mapped reads in BAM file- averaged across mapping locus and individual
#we could also easily split up the depth file here to look at population level variation in depth
Most individuals have greater than 10x sampling depth on average across all loci.
However, the number of mapped loci may vary across individuals, so let’s also look at overall number mapped reads per sample for a better idea of the distribution of reads per individual in the dataset.
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=1
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bam_mapped_reads
# set name of output file
#SBATCH --output=bam_mapped_reads.out
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in *bam
do
{ $samtools view -c -F 4 $i; echo ${i%.bam} ; } | tr "\n" " " >> mapped_reads.txt
done
mapped_reads<-read.table("./snp_calling/QC/mapped_reads.txt", sep = " ")
ggplot(data = mapped_reads)+geom_histogram(aes(x = log10(mapped_reads[,1])))+labs(x = "log10 reads")
Read Mapping Fig. 2: Total number of reads per individual
Read Mapping fig 2. highlights the need for an genotype likelihood based approach. Even excluding outlier individuals, the sampling depth varies 20-50 fold among samples. This figure also shows that Read Mapping fig. 1 is misleading. The reason for similarity in read depth depicted Read Mapping fig. 1 might be because of dropout of loci in individuals with low depth. Also of note is the roughly 5% of individuals with extremely poor coverage (less than ~50000 reads)
F. grandis
Keeping grandis in the analysis may be useful to reconstruct the ancestral state of the genome and therefore create an unfolded site frequency spectrum (SFS), but first we need to rationalize maintaining them in the genotyping run.
How well do the grandis reads align to the fundulus reference genome? Let’s compare the bam files:
colnames(mapped_reads) <- c("reads", "sample")
mapped_reads$pop<-str_extract(mapped_reads$sample, "[^_]+")
#mapped_reads %>%
# group_by(pop) %>%
# summarize(mean = mean(reads), sd = sd(reads), n = n())
# can uncomment this if a reviewer wants this picture of the variation later
ggplot()+geom_density(data = mapped_reads[(mapped_reads$pop!="grandis"),],aes(x=log10(reads)))+geom_density(data = mapped_reads[(mapped_reads$pop=="grandis"),],aes(x=log10(reads)), color = "red") + labs(x = "Log10 Reads")
Read Mapping Fig. 3: Total number of reads (log10) for F. heteroclitus (black) vs F. grandis
This looks great, a similar number of reads map to the genome regardless of the species the reads originate from. This suggests that BWA-mem can tolerate the mismatches owing to species level variation well enough to map the reads.
this isn’t ideal though, lets try to figure out the proportion of reads successfully mapped to the reference genome
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=1
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bam_mapped_reads
# set name of output file
#SBATCH --output=bam_mapped_reads.out
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in *bam
do
{ $samtools view -c -F 0 $i; echo ${i%.bam} ; } | tr "\n" " " >> total_reads.txt
done
#this script pulls the totla reads (not mapped reads) form the bam files using the samtools bitset flag 0
#needed to fix this output quickly using regex -replaced every second space with a linebreak
#make new dataframe for proportion of reads and plot
total_reads <- read.table("./snp_calling/QC/total_reads.txt", sep = " ")
# make column of total reads, then proportion reads, plot again
colnames(total_reads) <- c( "total", "sample")
mapped_reads <- cbind(mapped_reads, total_reads$total)
colnames(mapped_reads)[4] <- c("total_reads")
mapped_reads$proportion <- mapped_reads$reads/mapped_reads$total_reads
ggplot()+geom_density(data = mapped_reads[(mapped_reads$pop!="grandis"),],aes(x=log10(proportion)))+geom_density(data = mapped_reads[(mapped_reads$pop=="grandis"),],aes(x=log10(proportion)), color = "red") + labs(x = "Log10 Reads")
Read Mapping Fig. 4: Total proportion of mapped reads (log10) for F. heteroclitus (black) vs F. grandis
A similar proportion of reads map regardless of species, suggesting the divergence between the species is not so high that it will cause major issues to include grandis
ANGSD Rationale
ANGSD is used to generate genotype likelihoods from the demultiplexed, filtered reads.
ANGSD Analysis outline
We conduct two sets of ANGSD analyses. The first calculates genotype likelihoods (flowchart below).
grViz("digraph flowchart {
# node definitions with substituted label text for placing code
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
tab9 [label = '@@9']
# edge definitions with the node IDs
tab1 -> tab3;
tab1 -> tab2;
tab3 -> tab4 -> tab5
tab4 -> tab6
tab3 -> tab7
tab3 -> tab8
tab9 -> tab1
}
[1]: 'ANGSD -GL 1 -doMaf -doMajorMinor -glf 2'
[2]: 'Allele Frequencies'
[3]: 'beagle format GLs'
[4]: 'covariance matrix (PCangsd)'
[5]: 'PCA (R)'
[6]: 'RDA/other multivariate EAAs (R)'
[7]: 'LD (ngsLD)'
[8]: 'Admixture (ngsAdmix)'
[9]: 'bam files'
")
This analysis produces population level allele frequencies and genotype likelihoods (GLs). These genotype likelihoods are used to calculate linkage disequilibrium, admixture proportions and a genetic coviarance matrix. The covariance matrix is used as the basis for a further population structure analysis (principal component analysis) and a environmental association analysis (redundancy analysis) to parse adaptive from neutral genetic variation.
The second ANGSD analysis is a separate project, but briefly summarized here (flowchart below)
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
# edge definitions with the node IDs
tab1 -> tab2
tab2 -> tab3
tab3 -> tab4 -> tab5
tab3 -> tab6
tab3 -> tab7
tab8 -> tab1
}
[1]: 'ANGSD -GL 1 -doSAF unfolded'
[2]: 'SAF'
[3]: 'SFS (realSFS)'
[4]: 'Dadi conversion '
[5]: 'moments/Dadi'
[6]: 'theta, tajD'
[7]: 'FST'
[8]: 'bam files'
")
In this second analysis we calculate site allele frequencies from the genotype likelihoods, then use this SAF to generate site frequency spectra. This SFS can be used for demographic modelling to examine the hypothesis that contemporary F. heteroclitus population genetic structure was shaped by post-glacial secondary recontact between the main population and a northern glacial refugia population. Secondarily, the SFS cna be used to calculate site-wise FST and other site wise stats (theta and Tajima’s D) and may be included in the main analysis if an outlier approach to identifying adaptive genetic variation is desirable.
In this section we set up file directories, create metadata files and manipulate input files for the ANGSD run.
Create lists of bams for all the separate populations
#in server directory containing bams
# for each population create a list of paths to bam files
find "$(cd ..; pwd)" -iname "BP*bam" > ../BP.bams
find "$(cd ..; pwd)" -iname "CT*bam" > ../CT.bams
find "$(cd ..; pwd)" -iname "GA*bam" > ../GA.bams
find "$(cd ..; pwd)" -iname "HB*bam" > ../HB.bams
find "$(cd ..; pwd)" -iname "KC*bam" > ../KC.bams
find "$(cd ..; pwd)" -iname "M_*bam" > ../M.bams
find "$(cd ..; pwd)" -iname "MDIBL_*bam" > ../MDIBL.bams
find "$(cd ..; pwd)" -iname "ME_*bam" > ../ME.bams
find "$(cd ..; pwd)" -iname "MG_*bam" > ../MG.bams
find "$(cd ..; pwd)" -iname "NBH_*bam" > ../NBH.bams
find "$(cd ..; pwd)" -iname "F_*bam" > ../F.bams
find "$(cd ..; pwd)" -iname "H_*bam" > ../H.bams
find "$(cd ..; pwd)" -iname "P_*bam" > ../P.bams
find "$(cd ..; pwd)" -iname "SYC_*bam" > ../SYC.bams
find "$(cd ..; pwd)" -iname "NC_*bam" > ../NC.bams
find "$(cd ..; pwd)" -iname "OC_*bam" > ../OC.bams
find "$(cd ..; pwd)" -iname "RB_*bam" > ../RB.bams
find "$(cd ..; pwd)" -iname "SC_*bam" > ../SC.bams
find "$(cd ..; pwd)" -iname "SH_*bam" > ../SH.bams
find "$(cd ..; pwd)" -iname "SLC_*bam" > ../SLC.bams
find "$(cd ..; pwd)" -iname "SR_*bam" > ../SR.bams
find "$(cd ..; pwd)" -iname "WNC_*bam" > ../WNC.bams
find "$(cd ..; pwd)" -iname "grandis_*bam" > ../grandis.bams
cat *bams > ALL.bamlist
Compress the genome and create a compression index
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bgzip_index
# set name of output file
#SBATCH --output=bgzip.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
/home/ddayan/software/htslib-1.9/bgzip -i -@ 10 f_het_genome_3.0.2.fa
Create a samtools index
samtools faidx f_het_genome_3.0.2.fa.gz
Index all the bam files
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bgzip_index
# set name of output file
#SBATCH --output=bgzip.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in ./*.bam;do $samtools index $i;done
In this section we conduct an exploratory data analysis to explore site coverage and the sensitivity of ANGSD to site-filtering parameters.
First lets take a look at coverage and allele freqeuncy to get an idea what the data looks like prior to filtering. This first ANGSD run does not estimate genotype likelihoods, instead it only collects the distribution of quality scores and read counts.
ANGSD options:
command options
-P 10: use 10 threads
-ref: path to reference genome
-out: output
-r : run on only one contig (to check speed)
filtering
-uniqueOnly: use only uniquley mapping reads
-remove_bads: discard bad reads
-trim 0: perform no trimming
-C exclude reads with excessive mismatches
-baq avoid SNPs called due to indels
-minmapQ exclude reads with poor mapping
-maxDepth - read depth above this level are binned (set high at this point to look at actual depth distrubution)
dos
doQsdist - calulates distribution of quality scores of used reads
doDepth - calculates distrubtion of read depths
doCounts - needed for depth calcuation
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=bwa_default
# set name of output file
#SBATCH --output=bwadefault.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
$ANGSD -P 10 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc \
-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -maxDepth 1000 \
-minMapQ 20 -doQsDist 1 -doDepth 1 -doCounts 1
Results of initial ANGSD run
Q scores and global depth:
## barplot q-scores
qs <- read.table(file = "./snp_calling/ANGSD_outputs/qc_check/ALL.qc.qs", head=T, stringsAsFactors=F)
barplot(height=qs$counts, names.arg=qs$qscore, xlab="Q-score", ylab="Counts", main = "Q-score distribution")
## global depth
dep <- as.numeric(scan(file = "./snp_calling/ANGSD_outputs/qc_check/ALL.qc.depthGlobal",what="char", quiet=T))
dep_freq<-as.data.frame(cbind(dep, seq(1:length(dep))))
colnames(dep_freq)<-c("count", "depth")
dep_freq$depth<-dep_freq$depth-1
ggplot(data = dep_freq, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")
ggplot(data = dep_freq, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(0, 100)
EDA Fig. 2: Total number of reads (across all indivuals) per site. Right plot is subset of main plot
other salient results from outputs
Total number of sites analyzed: 567547
Number of sites retained after filtering: 559561
Initial EDA conclusions from these figures:
- Quality scores are high: this is no surprise, as the raw input data also had high quality scores
- Most sites have very limited depth: with 1480 samples, the global depth is largely less than 1000
- 59000 sites (about 10%) have more than 1000 reads, the next step is to check whether these are driven by a moderate depth in a lot of individuals or extremely high depth in just a few?
Coverage EDA continued
This time, we do some standard filtering: only retain sites present in 80% of individuals.
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=bwa_default
# set name of output file
#SBATCH --output=bwadefault.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
$ANGSD -P 10 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc -r KN811289.1 \
-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -maxDepth 1000 -minInd 1184 \
-minMapQ 20 -doQsDist 1 -doDepth 1 -doCounts 1
#to speed things up we just run on one chromosome (the -r option), but this produced no sites, so ran again with full genome
There were only 2,702 sites with coverage in at least 80% of individuals (and met other filtering criteria). This roughly aligns with the results using other genotyping pipelines (i.e. stacks and Tassel), which each found around 2-10k loci depending on filtering parameters. A question to consider down the road in the EDA is whether this is driven my poor coverage overall, dropout of sites, or the number of poorly sequenced individuals approaching 20%. Will examine this further below.
Here we’re going to finally run ANGSD to estimate genotype likelihoods. All outputs use the prefix gl_0.1.
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=angsd_GL_0.1
# set name of output file
#SBATCH --output=angsd_GL_0.1.out
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
FILTERS="-minInd 740 -uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -skipTriallelic 1 -SNP_pval 1e-6 -minMapQ 20 -maxDepth 14800"
DOS="-GL 1 -doMajorMinor 1 -doMaf 2 -doGlf 2 -doCounts 1 -doDepth 1"
$ANGSD -P 38 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc \
$FILTERS $DOS \
Total number of sites analyzed: 80148766
Number of sites retained after filtering: 62571
Coverage
Of the 62571 sites, 47984 (76%) have more than 14800 (i.e. have more than 10x coverage summed across all individuals)
Lets take a look at a handful of individual per-site depth distributions:
## individual depth
idep<-read.table("./snp_calling/ANGSD_outputs/run_0.1/ALL.qc.depthSample")
#look at read depth histogram for the first tenindividual
#i100<-as.data.frame(t(idep[1,]))
#i100$depth<-c(0:14799, by = 1)
#colnames(i100)<-c("log10count", "depth")
#i100$log10count <- log10(i100$log10count)
#ggplot(data = i100, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,500))
i100<-as.data.frame(t(idep[100,]))
i100$depth<-c(0:14799, by = 1)
i100$log10count <- log10(i100$`1`)
colnames(i100)<-c("count", "depth", "log10count")
#ggplot(data = i100, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,500))
ggplot(data = i100, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,400))+ylim(c(0,4000))+labs(title = "individual 100")
i1000<-as.data.frame(t(idep[1000,]))
i1000$depth<-c(0:14799, by = 1)
i1000$log10count <- log10(i1000$`1000`)
colnames(i1000)<-c("count", "depth", "log10count")
#ggplot(data = i1000, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,1000))
ggplot(data = i1000, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,400))+ylim(c(0,2000))+labs(title = "individual 1000")
i500<-as.data.frame(t(idep[500,]))
i500$depth<-c(0:14799, by = 1)
i500$log10count <- log10(i500$`500`)
colnames(i500)<-c("count", "depth", "log10count")
#ggplot(data = i500, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,500))
ggplot(data = i500, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,300))+ylim(c(0,3000))+labs(title = "individual 500")
i148<-as.data.frame(t(idep[148,]))
i148$depth<-c(0:14799, by = 1)
i148$log10count <- log10(i148$`148`)
colnames(i148)<-c("count", "depth", "log10count")
#ggplot(data = i148, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,148))
ggplot(data = i148, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,300))+ylim(c(0,3000))+labs(title = "grandis individual")
If we look at a handful of depth for individual fish (reads per locus per individual) we observe a nice distribution with mostly >20x coverage
Now that we have some genotype likelihoods, it is a good time for a sanity check. We quickly run and plot a PCA on the likelihoods to check if we observe the population genetic structure we expect.
a note on installation (code chunk below):
#didn't want to wait for HPC to install pcangsd so did it locally. the problem (as always) is that pcangsd has depnedinecies that get install to local python but slurm calls a different python. used the tool virtualenv to get around this
#create the virtual environment in the directory containing pcagnsd
virtualenv software/pcangsd/
cd software/pcangsd/
#ctivate it
source bin/activate
#install dependcies and angsd
pip install cython
pip install numpy
python setup.py build_ext --inplace
pip install -r requirements.txt
#check install
python pcangsd.py
#deactive virtualenv
#when running in batch script make sure to source this virtual env before executing the actual command
deactivate
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=pcangsd_0.1
# set name of output file
#SBATCH --output=pcangsd_0.1.out
source /home/ddayan/software/pcangsd/bin/activate
python /home/ddayan/software/pcangsd/pcangsd.py -beagle ../Results/ALL_gl_0.1.beagle.gz -o pcangsd.gl_0.1 -threads 38
Exploratory PCA results
cov_mat<-read.table("./pcangsd/pcangsd.gl_0.1.cov", sep = " ")
filenames = scan("./snp_calling/ANGSD_outputs/ALL.bamlist", what="character")
fish_names = gsub("/home/ddayan/fundulus/aligned/", "", gsub(".merged", "", gsub(".bam", "", filenames)))
pc1<-prcomp(cov_mat)
pc_inds<-as.data.frame(pc1$x)
row.names(pc_inds)<-fish_names
pc_inds$pop<-gsub("_\\d+", "", fish_names)
#caution this deletes al nbh pops except nbh, fix later if we want to keep these outside the eda
pc_inds <- merge(pc_inds, sites[,c("Abbreviation", "Latitude")], by.x = "pop" , by.y ="Abbreviation")
ggplot(data = pc_inds) + geom_point(aes(x = PC1, y = PC2, color = Latitude))+scale_color_gradientn(colours = rainbow(5))#+stat_ellipse(aes(x = PC1, y = PC2, group = pop),type = "norm")
Fig EDA PCA 1: Principal component analysis of genetic variation from the exploratory ANGSD run (gl_0.1). Populations colored by latitude.
plot_ly(x=pc_inds$PC1, y=pc_inds$PC2, z=pc_inds$PC3, type="scatter3d", mode="markers", color=pc_inds$pop)
Fig EDA PCA2: Principal component analysis of genetic variation from the exploratory ANGSD run (gl_0.1) including 3rd pc. hover mouse over points for population id - population codes are as in the table “sites” under the section “data description” at top of document.
Conclusions of EDA PCA
It looks like there are some QC issues to take on (or a grandis individual is more similar to f het than other grandis), but that this approach is at least working well enough to differentiate pops in PC space. Also, it seems as though PC 1 dominates the PCA (75% of variance) and simply separates F grandis from F heteroclitus. PC2 (15%) separate poulations among the southern clade (south of NJ) while PC3 (3%) separates northern clade populations.
This section estimates the final genotype likelihoods using ANGSD.
Below are the rationales underlying the filtering parameters used
grandis
Keeping F grandis in the GL estimation? We will use the grandis reads to create a grandis consensus sequence (dofasta in angsd) which will be used as the ancestral state for unfolded SFS estimation, but we will not estimate genotype likelihoods in grandis as this will influence SNP probabilities and it seems that including grandis dominates the structure found by PCA. There is also evidence from the PCA that a grandis individual was mislabeled in the field as a GA fish. This should be filtered out.
Hardy Weinberg Equilibrium
No HWE filtering, filtering out paralogues by using excess coverage instead - rationale here is that HWE is likely to be skewed given the amount of population strcutre in the sample and that filtering by HWE stratified by population is not possible in the ANGSD pipeline
maf
MAF filtering will skew the SFS and should not be applied to the methods that rely on SFS (demography inference, FST), but the inclusion of singletons called genotyped datasets can confound STRUCTURE (little effect on multivariate population structure algorithms e.g. PCA DAPC). My assumption is that the use of genotype likelihoods instead of called genotypes should reduce the impact of singletons or low frequency alleles (erroneous or true) on the inference of structure using model based approaches like STRUCTURE, and the inclusion of rare alleles will improve discrimination among recently diverged populations (or populations with rare gene flow) when using multivariate approaches. See ANGSD literature for more here.
max_coverage
Instead of relying on HWE to filter out paralogues we use coverage. Will set max coverage at 2.5x the modal coverage per locus, this means will have to run ANGSD again with a higher maxdepth cutoff for the coverage output in order to estimate it…
min_coverage_locus
No hard minimum coverage cutoff, instead use stringent snp_pval cutoff and minimum number of individuals (we require that a site be present in 50% of samples - 686)
minimum_coverage_ind
The mean coverage for the individuals (at the bam level) is ~730k, but a handful of individuals have very low coverege (see bam stats section), will remove these individuals BEFORE ANGSD analysis by editing bam_lists, removed individuals in the lowest 5% of the coverage distribution
snp_pval
impose stringent probablility cutoff for reatining a SNP: 1e-6
quality_and_mapping
-uniqueOnly 1 -remove_bads 1 -C 50 -baq 1 -skipTriallelic 1 -minMapQ 20
individuals The EDA also suggests that roughly 5% of individuals have extremely poor coverage, these are removed PRIOR to GL estimation.
This section contains final (gl_1.3) genotype likelihood estimation, but a detailed version history and the relevant code is also included below.
version history
1.0 first run
1.1 removed outlier high coverage
1.2 start from scratch with mislabeled GA fish removed - GA_3318 clustered with grandis in PC space and also was it’s own group in PCA run on genotype likelihoods v1.1
1.3 coverage filtered output of 1.2
Make list of bam files, filtering out grandis and individuals in the lowest 5% of reads:
#get rid of individuals with low sequencing
bad_inds <- mapped_reads[mapped_reads$reads<quantile(mapped_reads$reads, 0.05),2]
#also get rid of mislabeled individual - pca_1.1 showed that "GA_3318" is most likely a grandis
bad_inds<-append(bad_inds, mapped_reads[140,2])
bad_inds <- mapped_reads[bad_inds,2]
#write.table(bad_inds, file = "./snp_calling/QC/bad_inds.txt", quote = FALSE, row.names = FALSE)
grep -v -f ./snp_calling/QC/bad_inds.txt ./snp_calling/ANGSD_outputs/ALL.bamlist | grep -v 'grandis' > ./snp_calling/QC/good_no_grandis_2.bamlist
This removed all grandis, 72 individuals with too few reads and 1 individual labeled as GA but actually grandis, leaving a final dataset of 1371 individuals.
Make genotype likelihood estimation v1.2
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=angsd_GL_1.0
# set name of output file
#SBATCH --output=angsd_GL_1.0.out
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
FILTERS="-minInd 686 -uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -skipTriallelic 1 -SNP_pval 1e-6 -minMapQ 20"
DOS="-GL 1 -doMajorMinor 1 -doMaf 2 -doGlf 2 -doCounts 1 -dumpCounts 2"
$ANGSD -P 38 -b ../meta_data/good_no_grandis_2.bamlist -ref $REF -out ./Results/gl_1.2 \
$FILTERS $DOS \
Run 1.0: Total number of sites analyzed: 78663923 Number of sites retained after filtering: 77717
Run 1.2 Total number of sites analyzed: 78419827 Number of sites retained after filtering: 75430
Extract and plot coverage data
#check the format before running this
depths_loci<-read.table("./snp_calling/QC/gl_1.2.pos", header = T, sep = "\t")
#mode calculation
getMode <- function(x) {
keys <- unique(x)
keys[which.max(tabulate(match(x, keys)))]
}
modal_depth <- getMode(depths_loci$totDepth)
ggplot(data = depths_loci) + geom_histogram(aes(x = totDepth))+xlim(c(0,75000)) + geom_vline(aes(xintercept = modal_depth), color = "red") + geom_vline(aes(xintercept = 2.5*modal_depth), color = "blue")
Depth at loci estimated by ANGSD. Red line is the modal depth, blue line is 2.5X the modal depth
The modal depth per locus is 16023. All loci with 2.5x greater read depth are then filtered out.
#note, remove comments if need to write the outputs to file
#find loci with >2.5x the mode and filter then write out to loci list which will be used to filter the GL outputs (filtered datasets are named 1.1)
sites_to_keep<-subset(depths_loci, totDepth < 2.5*modal_depth)
#write.table(sites_to_keep[,c(1,2)], file = "./snp_calling/QC/sites_to_keep.txt", row.names = FALSE, quote = FALSE, col.names = FALSE)
#get bad sites and store in beagle call format
sites_to_remove<-subset(depths_loci, totDepth > 2.5*modal_depth)
sites_to_remove$beagle_id<-paste(sites_to_remove$chr, "_", sites_to_remove$pos, sep = "")
#write.table(sites_to_remove[,c(4)], file = "./snp_calling/QC/sites_to_remove.txt", row.names = FALSE, quote = FALSE, col.names = FALSE)
GL_1.2 had 75430 SNPs, after fitlering out SNPs with greater than 2.5x the mode coverage (16023x = mode) there are 73722 SNPs
#filtering the datasets
zcat gl_1.2.beagle.gz > gl_1.2.beagle
awk 'FNR==NR { a[$1]; next } !($1 in a){print $0}' sites_to_remove.txt gl_1.2.beagle > gl_1.3.beagle
bgzip gl_1.3.beagle
First final results! This section details the population genetic structure of the species range using PCA (implemented in PCangsd) and a bayesian classifier similar to STRUCTURE called NGS-Admix.
The PCA relies on a covariance matrix estimated by PCangsd from the genotype likelihoods. This Covariance matrix is then subjected to eigen-decomposition to produce the PCA of genetic distance among samples.
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=pcangsd_1.3
# set name of output file
#SBATCH --output=pcangsd_1.3.out
source /home/ddayan/software/pcangsd/bin/activate
python /home/ddayan/software/pcangsd/pcangsd.py -beagle ../Results/gl_1.3.beagle.gz -o pcangsd.gl_1.1 -threads 38 -minMaf 0.0
PCANGSD run info:
Read 1371 samples and 73375 sites
Estimating population allele frequencies EM (MAF) converged at iteration: 12
Estimating covariance matrix Using 5 principal components (MAP test)
cov_mat<-read.table("./pcangsd/pcangsd.gl_1.3.cov", sep = " ")
filenames <- mapped_reads[! mapped_reads$sample %in% bad_inds,]
filenames <- filenames[!grepl("grandis", filenames$sample),]
filenames <- filenames[-c(1372, 1373),]
fish_names = gsub(".merged", "", filenames$sample)
pc2<-prcomp(cov_mat)
save(pc2, file = "./pcangsd/pca2.R")
pc_inds<-as.data.frame(pc2$x)
row.names(pc_inds)<-fish_names
pc_inds$pop<-gsub("_\\d+", "", fish_names)
# write.table(pc_inds, file = "./pcangsd/pca.txt", quote = FALSE) #commented to avoid overwriting, run if doing first run through analysis
#caution this deletes al nbh pops except nbh, fix later
pc_inds <- merge(pc_inds, sites[,c("Abbreviation", "Latitude")], by.x = "pop" , by.y ="Abbreviation")
ggplot(data = pc_inds) + geom_point(aes(x = PC1, y = PC2, color = Latitude))+scale_color_gradientn(colours = rainbow(5))#+stat_ellipse(aes(x = PC1, y = PC2, group = pop),type = "norm")
Results Fig. 1: Principal component analysis of genetic variation. Populations colored by latitude.
plot_ly(x=pc_inds$PC1, y=pc_inds$PC2, z=pc_inds$PC3, type="scatter3d", mode="markers", color=pc_inds$pop)
Results Fig. 2: Principal component analysis of genetic variation, including 3rd pc. hover mouse over points for population id - population codes are as in the table “sites” under the section “data description” at top of document.
screeplot(pc2)
Screeplot of PCA
PCA Summary
The final PCA corroborates previous descriptions of the F heteroclitus population genetic cluster. There is a break centered at Northern New Jersey between two distinct clusters, within each cluster a pattern of isolation by distance dominates the structure.
NGSadmix is used to estimate ancestry for K = 1-6 with 10 replicate runs and best k determined by the Evanno method (not included in notebook - online app). Run details for publication quality and other other k plotting are in the “ngs_admix” directory.
Below is a structure plot of the best K according to Evanno method (k=2).
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=ngsadmix_1.3_k9
# set name of output file
#SBATCH --output=ngsadmix_1.3_k9
NGSadmix="/opt/angsd0.928/angsd/misc/NGSadmix"
BEAGLE="/home/ddayan/fundulus/ANGSD/Results/gl_1.3.beagle.gz"
Out_Prefix="1.3_k9"
K="9"
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.1 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.2 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.3 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.4 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.5 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.6 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.7 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.8 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.9 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.10 -P 10 -minMaf 0
#Plot admixture portions for a single run of k = 2
#make popfile
pops<- as.data.frame(cbind(fish_names, gsub("_\\d+", "", fish_names)))
colnames(pops) <- c("sample", "pop")
#plot a admixture plot
q<-read.table("./ngs_admix/sandbox/outFileName.qopt")
#set order of poulations by latitude
q$sample <- pops$sample
q$pop <- pops$pop
pops$pop <- factor( as.character(pops$pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "Mg", "CT", "SR", "BP", "HB", "Hb", "H", "F", "M", "Syc", "NBH", "P", "SLC", "ME", "MDIBL") )
ord <- order(pops$pop)
q <- q[ord,]
plot_data <- q %>%
mutate(id = sample) %>%
gather('cluster', 'prob', V1:V2) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup() %>%
dplyr::arrange(factor( as.character(pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "Mg", "CT", "SR", "BP", "HB", "Hb", "H", "F", "M", "Syc", "NBH", "P", "SLC", "ME", "MDIBL") ))
#reorder levels
plot_data$pop <- factor( as.character(plot_data$pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "Mg", "CT", "SR", "BP", "HB", "Hb", "H", "F", "M", "Syc", "NBH", "P", "SLC", "ME", "MDIBL") )
#ggplot(plot_data, aes(id, prob, fill = cluster)) +
# geom_col() +
# theme_classic()
ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col() +
facet_grid(~pop, scales = 'free', space = 'free')+ theme(panel.spacing=unit(0.01, "lines"), axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(), legend.position = "none")
Data also formatted for clumpak visualization below
(for log in `ls *.log`; do grep -Po 'like=\K[^ ]+' $log; done) > logfile
logs <- as.data.frame(read.table("./ngs_admix/logfile_likelihoods.txt"))
logs$K <- c(rep("1", 10), rep("2",
10), rep("3", 10), rep("4", 10), rep("5", 10), rep("6", 10))
colnames(logs)[1] <- "likelihood"
write.table(logs[, c(2, 1)], "/ngs_admix/logfile_likelihood_formatted.txt", row.names = F,
col.names = F, quote = F)
Note - delta K method suggest best K is 2, but prob(k) is maximixed at k = 6, suggesting runs of further K may be neccessary.
Admixture Results
The STRUCTURE plot above agrees with previous analyses. The primary aspect of population structure is the break north of New Jersey (i.e. best K is 2 and splits into two groups north and south of the Hudson Basin). Also of note on the k=2 plot is the evidence of introgression of northern ancestry into the admixture zone throughout New Jersey (in population SH, RB, OC, and MG). Surprisingly there is some introgression of southern alleles into the southern extremes of the northern population in Connecticut and Rhode Island(CT, and SR)
require(vegan)
require(ade4)
require(adespatial)
require(SoDA)
require(marmap)
Rationale:
Here the goal is to use multivariate genotype environmental association analysis to decompose the genetic variation along the cline into adaptive and neutral components, by finding genetic variation that correlates with putatively selective environmental variable after correcting for spatial autocorrellation. We use cRDA - redundancy analyses of components. Redundancy analysis is a multivariate constrained ordination that finds linear combinations of multiple explanatory variables (in this case temperature variation and spatial autocorrelation) that are “redudant with” (or correlate with) linear combinations of the response variable (in this case genetic variation). In our case the response variable is not individual genotypes, but components of genetic variation among individuals - a PCA of the covariance matrix produced by PCangsd. The type of RDA - termed a cRDA - is less powerful than a traditional RDA for finding outlier loci associated with the environmental variable because the individual SNPs must be subsequently identified as those correlated with the significantly constrained axes, i.e. the RDA is on the components of genetic variation not directly on the SNPs. This approach is preferable for our purposes, because (a) we are more interested in identifying the proportion and patterns of polygenic variation that may be due to selection and dempgraphy than identifying the actual SNPs and (b) conducting RDA on SNP level variation ould require calling SNPs rather than relying on genotype probabilities used to calculate the covariance matrix.
The genetic variation (principal components) are modelled as an effect of two sets of variables (1) a spatial variable that capture netural structure owing to isolation by distance and (2) environmental variables with a priori support that they drive adaptive evolution in this species. Further details for these variables appear in the relevant sections below.
First we set up the dataset to run the RDA
Spatial Variables
The spatial variable is a distance bases moran’s eigenvector map (db-MEM). DB-MEM is used to decompose the spatial structure into a set of orthoganl eigenfuctions thereby providing better resolution of autocorrelation across spatial scales.
To calculate the db-MEMs we use a distance matrix based on travel distance among sites. i.e. Rather than using distance matrices based on latitude and longitude, we make a more biologically relevant model where spatial distances among populations is determined by connections via shallow water <15 meters) this is accomplished via the marmap package in R
plot(atlantic, image = TRUE, land = TRUE, lwd = 0.03, bpal = list(c(0, max(atlantic), greys),c(min(atlantic), 0, blues)))
points(sites[,c(3,2)], pch = 21, col = "blue", bg = col2alpha("blue", .9),cex = 1.2)
lapply(path, lines, col = "orange", lwd = 2, lty = 1)->dummy
Now that more ecologically relevant distances are calculated we estimate distance-based eigenvector maps
#prep site data for dbmem
#merge all NBH inds to a single pop and convert to all caps
pops <- pops %>%
mutate(pop = as.character(pop)) %>%
mutate(pop = replace(pop, pop == "F", "NBH")) %>%
mutate(pop = replace(pop, pop == "P" , "NBH")) %>%
mutate(pop = replace(pop, pop == "Syc" , "NBH")) %>%
mutate(pop = replace(pop, pop == "H", "NBH")) %>%
mutate(pop = replace(pop, pop == "Mg", "MG")) %>%
mutate(pop = replace(pop, pop == "Hb", "HB"))
ind_sites <- pops %>%
left_join(., sites, by = "pop")
distance_matrix <- lc.dist(trans, ind_sites[,c(4,3)], res = "dist")
#calculate dbmem and plot
dbmem_pos<-dbmem(distance_matrix, MEM.autocor = "positive")
plot(attributes(dbmem_pos)$values)
abline(h = 0)
Figure Caption: Plot of db-mems I values
The first 8 dbmems have moran-s I values above 0 (i.e describe spatial autocorrelation) and are retained as explanatory variables in the RDA
Temperature
The environmental data is daily SSTs from the NOAA OI SST V2 High Resolution Dataset.
We extracted 10 years of daily optimum interpolated sea surface temperatures (OISSTs) from NOAA/OAR/ESRL PSD (Reynolds et al. 2007) for each of the sampling locations and calculated three derived environmental variables: mean temperature, mean annual minimum and mean annual maximum. then calculated PCA. Scripts for this data extraction are not included in this R notebook.
#env variable for pops
env<-read.table("./environmental/temp_summary.txt", sep = "\t", header = T)
env <- env %>%
left_join(., pops, by = "pop")
#pca
env_pc <- prcomp(env[,c(2,3,4)], scale = TRUE)
biplot(env_pc)
plot(env_pc)
env <- cbind(env, env_pc$x)
Temperature variation PCA and associated screeplot
pairs.panels(env[,c(2,3,4,5,7,8,9)], scale = TRUE)
Comparing temperature variables
After all of that, chose to keep only two environmental variables, mean_annual_min and mean_annual max, because tenyearmean is so correlated with both of the other variables.
We use the covariance matrix calculated by PCANGSD to conduct a PCA that describes the genetic variation among samples. Then choose the number of PCs to retain using the Kaiser Guttman criteria
#pca
#to load pca: attach("./pcangsd/pca.r")
#snp_pca<-read.table("./pcangsd/pca.txt", header = T, sep= "\t")
#kaiser guttman
cutoff<-mean(pc2$sdev^2)
kg <- length((pc2$sdev^2)[(pc2$sdev^2)>cutoff])
#pcs to keep
snp_pcs <- pc2$x[,c(1:103)]
We keep the first 103 pcs to fit our RDA on
To make coding easier, merge explanatory vars into single dataset with varnames
ex_var <- cbind(dbmem_pos, env[,c(2,3)])
In this section we run and visualize the RDA results
First we run variable selection procedures and settle on the final model. Output for this section is too long to include in notebook, but stored in ./rda/rda_results_log.txt
#global rda
rda_null <- rda(snp_pcs ~ 1, data = ex_var)
rda_full <- rda(snp_pcs ~ . , data = ex_var)
#check that the full model is significant
anova(rda_full) # 0.001 yes it is significant - proceed to variable selection
#what the variance explained
adjR2.rda_full <- RsquareAdj(rda_full)$adj.r.squared # 76%
#variable selection
ordiR2 <- ordiR2step(rda_null, scope = formula(rda_full, data = ex_var))
ordiR2$anova #MEM7 not included in variable selection
ordi <- ordistep(rda_null, scope = formula(rda_full, data = ex_var))
rda.fs <- forward.sel(Y = snp_pcs, X = ex_var, adjR2thresh = adjR2.rda_full)
vif.cca(rda_full)
Then we fit the final model and test for significance of constrained axes and the explanarotry variable on margin. We also test the global signifance of the models. Detailed outputs are available in ./rda/rda_results_log.txt
#final model
rda_final <- rda(snp_pcs ~ mean_annual_max + mean_annual_min + MEM1 + MEM2 + MEM3 + MEM4 + MEM5 + MEM6 + MEM7+ MEM8 , data = ex_var)
#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(rda_final, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'holm')
#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <- anova.cca(rda_final, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')
#variance partitioning
vp <- varpart(snp_pcs , ex_var[,c(9,10)] , ex_var[,c(1:8)])
#controlling for spatial autocor
rda_ind_con_spatial<-rda(snp_pcs ~ as.matrix(env[,c(2)]) + Condition((as.matrix(dbmem_pos)[,c(1:8)])))
#controlling for temp
rda_ind_con_temp<-rda(snp_pcs ~ Condition(as.matrix(env[,c(2)])) + (as.matrix(dbmem_pos)[,c(1:8)]))
#anovas
aov_global <- anova(rda_final)
aov_con_spatial <- anova(rda_ind_con_spatial)
aov_con_temp <- anova(rda_ind_con_temp)
RDA Results Summary
We fit a global model of all variables. This full model was significant. We then performed forward variable selection. Three approaches yielded 3 results. Ordistep (vegan) ordiR2step(vegan), and forward.sel(adespatial). Ordistep (uses AIC) call for a full model. OrdiR2step (uses adjusted R2 and p value only) retains all MEMs and mean annual min. Forward.sel retains all MEMs and no temperature variables. Variance inflation factors are low for all explanatory vars (< 3.4). For the sake of ecological interpretability we retain both environmental vars - i.e. went with AIC.
The first 6 RDA axes are significant (i.e. explain significantly more than the equivalent unconstrained axes, Holms adjusted p value < 0.05). All explanatory variables are also significant (i.e. significantly greater explanation of variance for the variable, once the data is conditioned on all other variables, Holms adjusted p value < 0.05)
Collectively these suggest that there is a significant relationship between the environmental temperature variable and the genetic covariance among samples, once the genetic data is conditioned on the effect of spatial autocorrelation AND vice versa. 76.5% of the total variation in the genetic dataset (or at least the first 103 of 1371 principal components of it) was constrained by the RDA.
Using variance partitioning, the relative effect of spatial autocorrelation and temperature could be estimated. Adjusted R2 for the effect of temperature, once spatial autocorrelation is accounted for, is 0.1%. In contrast 64.7% of the genetic dataset was explained by spatial autocorrelation alone. 11.4% of variation was explained jointly and could not be parsed. 23.7% was residual.
Examining the triplot suggests that the first dbmem separates northern from southern at the NJ.
RDA figures and results tables
## tri plot
#site score
rda_sum<-summary(rda_full, axes = c(6))
scores <- as.data.frame(rda_sum$sites)
scores$pop<-pops$pop
scores$pop <- factor( as.character(scores$pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "CT", "SR", "BP", "HB", "NBH", "P", "SLC", "ME", "MDIBL") )
#species scores
arrows<-as.data.frame(rda_sum$biplot)
#plot
ggplot()+geom_point(aes(RDA1, RDA2, color = pop), data = scores)+geom_segment(data=arrows, aes(x=0, y=0, xend=RDA1, yend=RDA2), arrow=arrow(length=unit(0.2,"cm")), color="red")+geom_text(data=arrows, aes(x=RDA1+RDA1*0.2, y=RDA2+RDA2*0.2, label=c("MEM1", "MEM2","MEM3","MEM4","MEM5", "MEM6","MEM7", "MEM8", "MaxTemp", "MinTemp")), size = 4, color="red")+theme_bw()+xlab("Redundant Axis 1 - 58%")+ylab("Redundant Axis 1 - 17%")
RDA Triplot
## screeplot
constrained_percent <- rda_full$CCA$eig/rda_full$tot.chi*100
unconstrained_percent <- rda_full$CA$eig/rda_full$tot.chi*100
barplot (c(constrained_percent, unconstrained_percent), col = c(rep ('red', length (constrained_percent)), rep ('black', length (unconstrained_percent))), las = 2, ylab = '% variation')
Screeplot of RDA - red is constrained axes, black is unconstrained axes
minor notes for deeper analysis if wanted
* if we conduct the RDA on all pcs (not just those exceeding the kaiser guttman criteria) we can relate the RDA results more clearly to FST, i.e. we can see how much of the total variation among the cline is explained by RDA and compare this to results from amova and so on
* cRDA (RDA on the components of genetic variation) seems to be less powerful at identifying adaptive loci, largely because these loci can load into pcs, could do a locus by locus RDA, but this would present its own challenges - somehting to consider
Work using the SFS is contained in a seaprate R notebook, all code below is draft for future work
SFS Analysis Outline
* Make consensus sequence from grandis reads to polarize the SAF - doFasta -2
* estimate SAFs for each population and metapopulation -doSAF * estimate per population SFS
* estimate 2dSFS for each pair of populations
* 1d and 2dSFS used in downstream analyses: per-site summary statistics, global Fst, and dadi
questions: is it possible to estimate SAF without re-estimating GLs (we already have them) what filtering is neccesarry for the population and 2d sfs, when there are so few indiviudals a maf of 1% becomes unmeaningful, should we filter to a higher MAF or at least, increase min number of individuals for a site to be retained (i.e. saf estimate is likely to be very very bad if there are sites with 2 individuals) *how should we pool sampling locales into metapopulations (i.e. it’s clear that all the NBH populations should be combined into a single pop, but what about all NJ populations or all ME populations) - what spatial or genetic scale should be the cutoff for pooling populations
ancestral sequence
Polarize SAF by an outgroup (f grandis) use doFasta -2 with -doCounts 1 and the same filters as $FILTERS from the GL run
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=consensus
# set name of output file
#SBATCH --output=consensus.out
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
FILTERS="-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -minMapQ 20"
DOS="-doFasta 2 -doCounts 1 -dumpCounts 2"
$ANGSD -P 38 -b ../meta_data/grandis.bams -ref $REF -out ./Results/grandis_consensus.fa \
$FILTERS $DOS \
SAF 1 big population